Skip to content

Implement ConcurrentLfu events#795

Merged
bitfaster merged 26 commits intomainfrom
users/alexpeck/lfuevents10
May 1, 2026
Merged

Implement ConcurrentLfu events#795
bitfaster merged 26 commits intomainfrom
users/alexpeck/lfuevents10

Conversation

@bitfaster
Copy link
Copy Markdown
Owner

@bitfaster bitfaster commented Apr 29, 2026

This PR implements events for ConcurrentLfu as a switchable policy. When disabled all event code is fully elided at runtime by the JIT compiler.

Events are considered perf critical in ConcurrentLfu because the ItemRemoved logic is invoked as part of the maintenance cycle, this introduces overhead even when there are no events registered. The Maintenance method latency determines cache throughput at the limit, so any overhead here is not desired. Later, these events could be captured in a list and processed asynchronously via the scheduler.

In this implementation, calling TryRemove defers event processing to the maintenance cycle (thus TryRemove and policy based eviction behave the same), whereas TryUpdate executes the event handler immediately when TryUpdate is called.

Based on this earlier PR: #727

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 29, 2026

Coverage Status

coverage: 99.192% (+0.01%) from 99.181% — users/alexpeck/lfuevents10 into main

@bitfaster
Copy link
Copy Markdown
Owner Author

bitfaster commented Apr 29, 2026

Analysis performed by Claude, data shows code size in bytes:

Update path (TryUpdateEventInliner.OnUpdatedEvent)

Method EventPolicy NoEventPolicy Δ
AddOrUpdate 897 802 −95

Maintenance path (DoMaintenanceEvictEventInliner.OnRemovedEvent)

Method EventPolicy NoEventPolicy Δ
AddOrUpdate (drives writes) 977 889 −88
OnWrite (drains buffered removes) 1,503 1,391 −112
EvictFromMain (eviction loop) 1,653 1,545 −108
Evict (single-victim teardown) 358 260 −98
EventPolicy.OnItemRemoved (standalone) 112 not emitted −112
Maintenance 2,467 2,495 +28 (layout)
DoMaintenance 1,285 1,302 +17 (layout)
AfterWrite, EvictFromWindow, OptimizePartitioning, TryScheduleDrain, ScheduleAfterWrite, AdmitCandidate identical identical 0

Assembly evidence

In EventPolicy Evict (and OnWrite):

mov  rcx, offset MT_BitFaster.Caching.ItemRemovedEventArgs<Int32, Int32>
call CORINFO_HELP_NEWSFAST                ; allocate args
...
call qword ptr [7FFC...]                  ; EventPolicy.OnItemRemoved(Int32, Int32, ItemRemovedReason)
call qword ptr [r14+18]                   ; invoke delegate

In NoEventPolicy Evict (and OnWrite): zero matches for ItemRemovedEventArgs, OnItemRemoved, or eventPolicy field loads. The epilogue goes straight from evictedCount++ to ret.

Verdict

EventInliner.IsEnabled = typeof(E) == typeof(EventPolicy<K,V>) is folded as a JIT-time constant per generic instantiation, eliminating the event branch and everything inside it: oldValue capture, delegate field loads, null check, args allocation, invocation.

What causes the layout difference to increase code size?

The JIT compiles ConcurrentLfuCore<...EventPolicy> and ConcurrentLfuCore<...NoEventPolicy> as two separate methods
(struct generics → distinct codegen, no canonicalization). Even when the source is identical and no instructions are
added/removed, the emitted byte count can drift by tens of bytes from second-order effects:

  1. Branch encoding. x86 has jmp short (2 bytes, ±127B reach) vs jmp near (5 bytes). When the surrounding code shrinks
    because events were elided, some forward jumps that were near can switch to short — or vice versa. A single such flip
    is ±3 bytes.
  2. Register allocation differences. Registers r8–r15 require a 1-byte REX prefix; rax–rdi don't. Different live-range
    pressure between the two instantiations can shift one variable from rdi to r12, which silently grows every instruction
    touching it.
  3. Basic-block reordering. The JIT orders blocks by edge weight / heuristics. Different inlining decisions in callees
    can change perceived hotness and reorder blocks, which changes which edges are fall-through vs. branch.
  4. Alignment padding. The JIT inserts NOPs ahead of loop heads (often 16-byte alignment). When earlier code shrinks,
    the loop head's natural address shifts, so the padding changes.
  5. Profile counter placement. The tier-1 JIT inserts CORINFO_HELP_COUNTPROFILE32 calls; placement isn't bitwise
    identical across instantiations.

@bitfaster
Copy link
Copy Markdown
Owner Author

End-to-end latency: main vs this branch (NoEventPolicy)

ConcurrentLfu.GetOrAdd hot path (key already present), capacity 9, 1 stripe. main's pre-events ConcurrentLfu was built as BitFaster.Caching.MainBaseline.dll and referenced via extern alias so both implementations coexist in the same
benchmark process.

Scheduler main ConcurrentLfu this branch NoEventPolicy Δ
BackgroundThread 27.59 ns 25.22 ns −9% (within noise)
Foreground 45.90 ns 46.94 ns +2% (within noise)
ThreadPool 26.30 ns 28.11 ns +7% (within noise)
Null 12.56 ns 12.37 ns −1.5% (parity)

The Foreground\NullScheduler rows are the cleanest signal — it skips all scheduler-side interference. The BackgroundThread and ThreadPool rows carry larger variance from scheduler timing and inter-thread coordination, not from code-path differences inside the cache.

@bitfaster bitfaster marked this pull request as ready for review April 29, 2026 23:15
@bitfaster
Copy link
Copy Markdown
Owner Author

Before (main)

Method Mean Error StdDev Ratio Allocated
ConcurrentDictionary 3.988 ns 0.0489 ns 0.0457 ns 1.00 -
ConcurrentLfuBackground 27.655 ns 0.3675 ns 0.3438 ns 6.94 -
ConcurrentLfuForeround 44.977 ns 0.7981 ns 0.7466 ns 11.28 -
ConcurrentLfuThreadPool 30.827 ns 0.4778 ns 0.4469 ns 7.73 -
ConcurrentLfuNull 12.427 ns 0.1547 ns 0.1447 ns 3.12 -

After (8d4ee12)

Method Mean Error StdDev Ratio Allocated
ConcurrentDictionary 3.522 ns 0.0252 ns 0.0223 ns 1.00 -
ConcurrentLfuBackground 25.430 ns 0.0973 ns 0.0812 ns 7.22 -
ConcurrentLfuForeround 46.249 ns 0.3032 ns 0.2836 ns 13.13 -
ConcurrentLfuThreadPool 15.723 ns 0.3457 ns 0.4116 ns 4.46 -
ConcurrentLfuNull 12.849 ns 0.0553 ns 0.0490 ns 3.65 -

Not clear why threadpool became so fast as a one off, does not repro.

@bitfaster
Copy link
Copy Markdown
Owner Author

bitfaster commented Apr 30, 2026

Multiple runs (events and no events are the same code):

Results_Evict_500 Results_Evict_500

Comment thread BitFaster.Caching/Lfu/ConcurrentLfu.cs Outdated
Comment thread BitFaster.Caching/Lfu/ConcurrentLfu.cs Outdated
Comment thread BitFaster.Caching/Lfu/ConcurrentLfu.cs
Comment thread BitFaster.Caching/Lfu/ConcurrentLfuCore.cs Outdated
Alex Peck added 2 commits April 30, 2026 10:49
@bitfaster bitfaster merged commit 0686f4e into main May 1, 2026
26 of 27 checks passed
@bitfaster bitfaster deleted the users/alexpeck/lfuevents10 branch May 1, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants